The goal of this project is to build machine learning models that will be able to predict an NBA team’s success for a given season. We will be using data from Kaggle, more specifically the NBA Stats (1947-present) database, which was gathered from Basketball Reference by Sumitro Datta. The database contains 17 datasets, and for this project we will be utilizing a merge between “Team Summaries” and “Team Per 100 Poss.”
Before and during each NBA season, analysts are constantly asked the question, “Who are the favorites for this year? Who do you predict will win the championship?” Most of the time this question is answered by those analysts using a subjective opinion, but what if there was a way to quantify team success and be able to predict future teams based on the numbers? That is the question this project aims to solve.
Now that we have explained the background and significance of this project let’s discuss how we are going to implement it and analyze its results. Our goal is to be able to classify teams into various tiers of season success, and I will go into more detail later on about what those tiers will be and what they represent. To do this, there will be an extensive data cleaning process to get the dataset in working order to run the models, then we will visualize the predictor variables and see how they affect the outcome classes. We will then run four models: K-Nearest Neighbors, Elastic Net Regression, a Boosted Tree, then a Random Forest. Depending on whichever model performs the best, we will then continue to explore it by running it through our testing data. Let’s get to it!
Rather than per-game data, which was an option for this project, I chose to use per 100 possessions data because of the translatability between eras. I will be doing my predictions based on data dating back to the 1980s, a time period when basketball was played at a completely different pace. Using per 100 possessions data helps to standardize all of the metrics, giving a more accurate predictions.
Time to load in our libraries, each with a specific purpose, and that I left in the comments of the following code chunk.
library(tidymodels) # for running our models
library(tidyverse) # working with data frames
library(pROC) # ROC AUC curves
library(ROCR) # visualize scoring metrics
library(dplyr) # basic r functions
library(corrplot) # correlation matrices
library(kknn) # k nearest neighbors
library(yardstick) # confusion matrices
library(themis) # balancing unbalanced data
library(ranger) # random forest
library(glmnet) # elastic net
library(xgboost) # boosted tree
library(vip) # variable importance
library(ggplot2) # visualizing our data
library(naniar) # missing values plot
team_data = read.csv("Team Data.csv") # loading in our data
head(team_data) # getting an idea of what it looks like
Let’s remove some unnecessary columns that do not help with our
predictions. Columns like games played, minutes played, their arena and
attendance are not important for this project, so they can be taken out.
I will be removing the playoffs column as well, as it is
still “FALSE” for every team this year because the playoffs have not
begun yet, which can skew the data.
This dataset goes back to 1974 but there is some missing data from before 1980, so let’s remove those older dates too.
# removing unnecessary columns
team = subset(team_data, select = -c(X, lg, g, playoffs, abbreviation, mp, pw, pl, arena, attend, attend_g))
team = team %>% filter(season >= 1980) # using data only after 1980, so no missing data
head(team) # previewing data again
Rather than using wins and losses in this, I’d like to consolidate
them and create a column named win_pct, which is just the
team’s wins divided by total games played.
# adding win pct column to aid in prediction
team$win_pct = round(team$w / (team$w + team$l), digits = 3)
team = team %>% relocate(win_pct, .before=fg_per_100_poss)
head(team)
This dataset does not have a designated response column, or outcome variable, so let’s create that. To do that, I utilized the Thinking Basketball Podcast system of creating tiers. They classify teams in 7 tiers:
We will be utilizing the n_rtg, or net rating, column to
create our response. Net rating is one of the better one-number metrics
to quantify how good a team is in a given season, is calculated by this
formula:
Net Rating = Offensive Rating - Defensive Rating = Team points scored per 100 poss - Team points allowed per 100 poss
Although it is not perfect, it will give a great prediction for how well each team will perform. Let’s take a look at the highest lowest net ratings to get a gauge of the range of values we’re dealing with.
team = team %>% relocate(n_rtg, .before=fg_per_100_poss)
team_copy = team # making a copy for sorted net ratings
nrtg_sorted = team_copy[order(team_copy$n_rtg, decreasing = TRUE),] # cutoff points for each tier
head(nrtg_sorted) # seeing top few net ratings
tail(nrtg_sorted) # bottom few net ratings
To create tiers, I sorted my data frame in decreasing order of team performance, and based on how teams finished seasons I created cutoff points for each tier.
In this code you will notice me keeping the tiers as numbers, and this just makes it smoother for the visualization coming soon.
team = team %>%
mutate(outcome = case_when(
n_rtg >= 7.2 ~ "1",
(n_rtg >= 5.8 & n_rtg < 7.2) ~ "2",
(n_rtg >= 2.5 & n_rtg < 5.8) ~ "3",
(n_rtg >= 0.6 & n_rtg < 2.5) ~ "4",
(n_rtg >= -2 & n_rtg < 0.6) ~ "5",
(n_rtg >= -6 & n_rtg < -2) ~ "6",
n_rtg < -6 ~ "7",
TRUE ~ "Uncategorized"
))
team = team %>% relocate(outcome, .after=n_rtg)
tail(team)
Now that we’ve created our win percentage and outcome variables, we no longer want to include our net rating column, and now the win and loss column becomes irrelevant as well, because percentage matters more, especially in seasons where not all 82 games were played.
# create a new data frame without some more unnecessary columns that were used to create "team"
team_stats = team
team_stats = subset(team_stats, select = -c(n_rtg, w, l, age))
team_stats
team_stats$season = as.factor(team_stats$season)
team_stats$team = as.factor(team_stats$team)
team_stats$outcome = as.factor(team_stats$outcome)
head(team_stats)
Now that we’ve taken out data from before 1980, let’s see if there is any missing data. Turns out, by getting rid of those older dates we got rid of all of the missing data!
vis_miss(team) # no missing data!
Our dataset currently has 42 columns, and excluding
team, season and outcome, we have
39 predictors. However, for a smaller dataset that is too many, so we
will visualize the correlation between some predictors and see if we can
remove any. The method will be, if correlation between two predictors is
over 0.75, I will choose one to remove due to the fact that two highly
correlated predictors do not both need to be included.
team_stats %>% # obvious that many observations are correlated with each other, have to remove many
select(where(is.numeric)) %>%
cor() %>%
corrplot(type = "lower", diag = FALSE)
Based on this matrix, we will remove the following columns:
team_info = subset(team_stats, select = -c(x3pa_per_100_poss, x2p_per_100_poss, x2pa_per_100_poss, fta_per_100_poss, mov, srs, o_rtg, ft_per_100_poss, fta_per_100_poss, ts_percent, e_fg_percent, tov_percent, orb_percent, ft_fga, opp_e_fg_percent, opp_drb_percent, opp_ft_fga, x3p_ar, fg_per_100_poss, x2p_percent, opp_tov_percent))
Now let’s take a look and see if all of our predictors follow are less than positive or negative 0.75. And as we see, now everything looks good! Some variables are still correlated with each other, which makes sense, but it is not to the point where we need to worry about them and remove them for too high of a correlation.
team_info %>% # now no correlations over |0.75|
select(where(is.numeric)) %>%
cor() %>%
corrplot(type = "lower", diag = FALSE, addCoef.col = "black", number.cex = 0.6)
Now that we have trimmed down some of the extra data, we have the following predictors:
If a predictor that has _per_100_poss in its name, it is
signifying a stat per 100 possessions, which is rather self-explanatory
but I thought I’d save the explanation on each individual metric in the
list below.
win_pct: Team’s win percentage for the given regular
season.
fga_per_100_poss: Field goals attempted (everything
excluding free throws)
fg_percent: Field goal percentage
x3p_per_100_poss: Three-pointers attempted
x3p_percent: Three-point percentage
ft_percent: Free throw percentage
orb_per_100_poss: Number of Offensive
rebounds
drb_per_100_poss: Number of Defensive
rebounds
trb_per_100_poss: Total rebounds, surprisingly not
highly correlated with the previous two
ast_per_100_poss: Number of Assists
stl_per_100_poss: Number of Steals
blk_per_100_poss: Number of Blocks
tov_per_100_poss: Number of Turnovers
pf_per_100_poss: Number of fouls a team
commits
pts_per_100_poss: Amount of points the team
scores
sos: Strength of schedule, calculated by the win
percentages of opponents
d_rtg: Defensive rating, or the amount of points
allowed per 100 poss (I’ll specify here because it is a new
stat)
pace: Pace, or the reason for using per 100
possessions data, as pace is very variable between eras. It is
calculated by the number of possessions that occur during a 48-minute
game.
f_tr: Free throw rate, calculated as free throws
attempted per field goal attempt.
Now that we are done cleaning up our dataset, let’s start digging a little deeper and taking a look at how some of our predictor variables impact our outcome! I probably included more plots than necessary, but it was fascinating to see how everything interacts.
team_info %>% # barplot showing distribution of each tier
ggplot(aes(x = outcome, fill = outcome)) +
geom_bar() +
labs(x = "Team Outcome") +
scale_fill_brewer(palette = "OrRd", direction = -1)
As expected there are far fewer of tiers 1 and 2 (inner circle and dark horse contenders), as those two groups are much more difficult to reach. The bulk of team outcomes fall between tiers 3 and 6.
team_info %>%
ggplot(aes(fg_percent)) +
geom_bar(aes(fill = outcome), width = 0.001) +
labs(x = "Field Goal Percentage") +
scale_fill_brewer(palette = "OrRd", direction = -1)
Similar to many of the plots, field goal percentage and outcome seem to be heavily related, with most of the inner circle title contending teams being on the higher end of the spectrum.
team_info %>%
ggplot(aes(win_pct)) +
geom_bar(aes(fill = outcome), width = 0.01) +
labs(x = "Win Percentage") +
scale_fill_brewer(palette = "OrRd", direction = -1)
## Warning: `position_stack()` requires non-overlapping x intervals
The win percentage plot is not surprising as well, but it is the most resounding result of all plots. There is a clear separation between teams and their outcomes, and this does make sense because though this is a regular season dataset, regular season success is a strong indicator of how the team will do later on in the season. Sorry, 2007 Mavericks.
team_info %>%
ggplot(aes(x = pts_per_100_poss, y = d_rtg, color = outcome)) +
geom_point() +
labs(x = "Points Per 100 Possessions (Off. Rating)", y = "Defensive Rating") +
scale_color_brewer(palette = "OrRd", direction = -1)
Here I created a scatter plot that put pts_per_100_poss,
or offensive rating on the x axis and defensive rating on the y axis.
The bottom right of the graph is where all the elite teams are, and
these teams all have elite offenses and defenses. However there are some
outliers, for example there is a point on the far right that is fairly
high up on the graph. This would be an example of a team with a
historically great offense, but only an average defense.
After looking into the data, that point is actually the Boston Celtics of this year, who are on pace to become the greatest offense of all time. Their defense only looks mediocre on this graph, but they are actually the number 2 ranked defense, which just goes to show how much the game has changed over the years. Offenses are historically great compared to past eras, so the average defensive rating has gone up significantly.
team_info %>%
ggplot(aes(pace)) +
geom_bar(aes(fill = outcome), width = 0.3) +
labs(x = "Pace") +
scale_fill_brewer(palette = "OrRd", direction = -1)
## Warning: `position_stack()` requires non-overlapping x intervals
Pace seems to be the most evenly distributed metric so far, one that isn’t a pure giveaway as to its impact on the team’s outcome. Perhaps, though, like I mentioned earlier, this could be something to do with the era they played in. Let’s look into that a little more by analyzing adjusting for each season’s pace.
team_info %>%
ggplot(aes(x = season, y = pace, color = outcome)) +
geom_point() +
scale_color_brewer(palette = "OrRd", direction = -1) +
labs(x = "Season", y = "Pace") +
theme(axis.text.x = element_text(angle = 90, vjust = 0.5, hjust=1))
Surprisingly, there is no significant giveaway as to how much pace correlates to team winning. It is quite interesting to see how much the average pace of play has changed, with historic pace in the 80s then the game slowing down in the 90s and 00s, then coming back up today.
team_info %>%
ggplot(aes(x = x3p_percent, y = x3p_per_100_poss, color = outcome)) +
geom_point() +
labs(x = "3-Point Percentage", y = "3-Point Attempts Per 100") +
scale_color_brewer(palette = "OrRd", direction = -1)
Maybe it’s recency bias, but I expected 3-point shooting to have much more of an impact on outcome than there seems to be here. Though there are many elite teams towards the higher ends of percentage, number of attempts does not seem to be as significant of an impact as I expected. How about if I put season on the x-axis and volume on the y?
team_info %>%
ggplot(aes(x = season, y = x3p_per_100_poss, color = outcome)) +
geom_point() +
scale_color_brewer(palette = "OrRd", direction = -1) +
labs(x = "Season", y = "3-Point Attempts Per 100") +
theme(axis.text.x = element_text(angle = 90, vjust = 0.5, hjust=1))
Well, it is clear that league-wide 3-point volume increased over the years, but there does not seem to be a clear indication of direct impact to increasing winning.
Assist to turnover ratio is a widely used statistic, but it is mainly associated with players, not teams. It signifies how well a player, or in this case, team, is able to share the ball while not giving the opponent extra possessions.
team_info %>%
ggplot(aes(x = ast_per_100_poss, y = tov_per_100_poss, color = outcome)) +
geom_point() +
labs(x = "Assists Per 100", y = "Turnovers Per 100") +
scale_color_brewer(palette = "OrRd", direction = -1)
From this graph it does look like the teams on the higher end of assists and lower end on turnovers are the better teams! No wonder every analyst mentions this ratio so often.
team_info %>%
ggplot(aes(x = drb_per_100_poss, y = orb_per_100_poss, color = outcome)) +
geom_point() +
labs(x = "Defensive Rebounds Per 100", y = "Offensive Rebounds Per 100") +
scale_color_brewer(palette = "OrRd", direction = -1)
Before this project I would have expected offensive and defensive rebounding to be positively correlated, but that does not seem to be the case. However when it comes to the distribution of our response variable, outcome, in this scatter plot, there does not seem to be much association.
Strength of schedule is a metric that takes into account a team’s opponent win percentage throughout a season.
team_info %>%
ggplot(aes(sos)) +
geom_bar(aes(fill = outcome), width = 0.05) +
labs(x = "Strength of Schedule") +
scale_fill_brewer(palette = "OrRd", direction = -1)
## Warning: `position_stack()` requires non-overlapping x intervals
Another one I didn’t expect. Intuitively it kind of makes sense, but I would have thought that strength of schedule did not matter very much in predicting a team’s outcome, and that the elite teams would be able to beat any team on their schedule. While that is still true, this graph does seem to indicate that the weaker the opponents, the better the team outcome.
team_info %>%
ggplot(aes(x = f_tr, y = pf_per_100_poss, color = outcome)) +
geom_point() +
labs(x = "Free Throw Rate", y = "Personal Fouls Per 100") +
scale_color_brewer(palette = "OrRd", direction = -1)
This distribution seems pretty evenly spread, and again, not one I really saw coming. Being able to draw fouls and prevent how many fouls you draw is a very important skill for most NBA teams, but apparently not important enough to have much of an impact on team success. However it does look slightly like there is some evidence that less fouls committed has an impact, but free throw rate not so much.
team_info %>%
ggplot(aes(x = (stl_per_100_poss + blk_per_100_poss), y = tov_per_100_poss, color = outcome)) +
geom_point() +
labs(x = "Steals + Blocks Per 100", y = "Turnovers Per 100") +
scale_color_brewer(palette = "OrRd", direction = -1)
Similarly to the assist-to-turnover ratio graph, the relationship between the number of turnovers committed and forced is another important one for team success. This graph reflects that, as teams on the bottom right seem to be in higher tiers of overall success.
Now that we have visualized all of our data, it is finally time to build our models and put them to the test.
Before running any models, we need to randomly split our data into
testing and training. I have a smaller data set, so I felt that a 75/25
split would be good. Of course, we are stratifying on our outcome
variable, outcome. Setting a seed is important for us
because it ensures that we get the same random set every time we run the
models, thus standardizing the results.
set.seed(421)
team_split = team_info %>%
initial_split(prop = 0.75, strata = "outcome")
team_train = training(team_split)
team_test = testing(team_split)
dim(team_train)
## [1] 938 22
dim(team_test)
## [1] 316 22
Building a recipe helps gather all of the predictors and response necessary to build the models, and put them in one place for future use. In this recipe I will be using all the predictors I listed earlier. Below you can see the full thing.
set.seed(421)
team_recipe = recipe(outcome ~ win_pct + fga_per_100_poss + fg_percent + x3p_per_100_poss + x3p_percent + ft_percent + orb_per_100_poss + drb_per_100_poss + trb_per_100_poss + ast_per_100_poss + stl_per_100_poss + blk_per_100_poss + tov_per_100_poss + pf_per_100_poss + pts_per_100_poss + sos + d_rtg + pace + f_tr, data = team_train) %>%
step_dummy(all_nominal_predictors()) %>%
step_normalize(all_predictors())
prep(team_recipe) %>% bake(team_train)
Due to the fact that our data is unbalanced, we will use stratified
sampling on our response variable, outcome.
team_folds = vfold_cv(team_train, v = 10, strata = outcome) # 10-fold CV
I will be saving the data from my main file and running my models in separate .rmd files, then loading them back in to save time and avoiding running them multiple times.
save(team_folds, team_recipe, team_train, team_test, file = "modeling-setup.rda")
The most important step of this project is building the models
themselves. This step took the longest amount of time, as some of the
models took up to 30 minutes to run, which is a small number compared to
some of my peers, who had models with tens of thousands of observations
compared to my measly one thousand. In the setting of a multiclass
classification, the best metric to use was roc_auc, or the
area under the receiver operating characteristic curve. In this case, a
perfect model has a value of 1, so that is the goal with each model. I
will be tuning and running 4 models: K-Nearest Neighbors, an Elastic Net
Regression, a Boosted Tree, Random Forest and Neural Network. The steps
for each model are very similar, and began with building the recipe and
performing stratified sampling. Following that, the steps are:
roc_auc metric.Now that we have tuned our grids, let’s load them back into the main file and analyze their performance.
load("KNN.rda")
load("TeamElasticNet.rda")
load("TeamRandomForest.rda")
load("BoostedTree.rda")
load("TeamNeuralNet.rda")
Autoplots are a great way to visualize the performance of our models,
and can help us choose which of our tuned parameters performed best. We
will use the autoplot function in R to visualize the metric,
roc_auc.
K-Nearest Neighbors is a model that classifies by data based on, well, its nearest neighbors. It seems excessive, but I ran it on neighbors between 1 and 100 and found which number of neighbors gave the best fit.
autoplot(team_knn_fit) + theme_minimal()
I then found that somewhere just under 40 neighbors was the best
performing model, with a roc_auc just under 0.9.
Elastic Net Regression is a model that finds a medium between lasso
and ridge regression, two forms of regression that shrink parameters
down. In this model there are two parameters that need to be tuned,
penalty and mixture.
autoplot(tune_team_elastic) + theme_minimal()
The results of the best model were penalty = 0.01 and
mixture = 0.778, and they gave a roc_auc over
0.95, which is very good.
Random forests are models that are combinations of many smaller decision trees, and utilize random sampling to take different combinations of predictors to make predictions. In this model the parameters used are:
trees: Number of trees used, I ran my model on anywhere
between 100 and 800.mtry: Number of randomly sampled from predictors. My
model had 19 predictors, so my range was 1 to 19.min_n: Minimal node size, or the number of observations
in each node of a tree. I chose values between 10 and 20.autoplot(tune_team_rf) + theme_minimal()
The tree with mtry = 10, trees = 800 and
min_n = 10 was the best model, with a roc_auc_
value close to 0.96.
Boosted trees are another way of combining the results of individual trees, however the mechanism is slightly different for a boosted tree. Instead of combining trees in parallel, boosted trees do their work sequentially, each tree learning from the last and eventually coming to the best model. The parameters for this are:
mtry: Same as random forest. I chose 1 to 19 as my
range.trees: Same as well. I chose 100 to 800 again.learn-rate: How quick the model learns. I chose values
between -10 and -1, and these are scaled as exponents of 10. So a value
of -1 means 0.1, so on and so forth.autoplot(tune_team_bt) + theme_minimal()
The best model was one with mtry = 5, trees
= 100, and learn_rate = 0.1.
autoplot(tune_nnet_grid) + theme_minimal()
The neural network did not perform as well as some of the other models, with its best ROC AUC score being around 0.70.
In order to compare the performance of each model, I created a tibble that contained the ROC AUC scores of each, in order to get a clear visualization of how well each model did.
knn_roc_auc = augment(final_knn, new_data = team_train) %>%
roc_auc(outcome, .pred_1:.pred_7) %>%
select(.estimate)
elastic_roc_auc = augment(final_elastic, new_data = team_train) %>%
roc_auc(outcome, .pred_1:.pred_7) %>%
select(.estimate)
bt_roc_auc = augment(final_bt, new_data = team_train) %>%
roc_auc(outcome, .pred_1:.pred_7) %>%
select(.estimate)
rf_roc_auc = augment(final_rf, new_data = team_train) %>%
roc_auc(outcome, .pred_1:.pred_7) %>%
select(.estimate)
nn_roc_auc = rf_roc_auc = augment(final_nn, new_data = team_train) %>%
roc_auc(outcome, .pred_1:.pred_7) %>%
select(.estimate)
roc_auc_values = c(knn_roc_auc$.estimate,
elastic_roc_auc$.estimate,
bt_roc_auc$.estimate,
rf_roc_auc$.estimate,
nn_roc_auc$.estimate)
model_names = c("K-Nearest Neighbors",
"Elastic Net",
"Boosted Tree",
"Random Forest",
"Neural Network")
final_results = tibble(Model = model_names,
ROC_AUC = roc_auc_values)
final_results %>%
arrange(-roc_auc_values)
From the graphs it is clear that all models ran extremely well, with
none having roc_auc values under 0.95! However the boosted
tree did the best, and we will be using it going forward to run on our
testing data and see how well it truly did.
From the tibble above, it was clear that the boosted tree performed the best out of all models. Now, let’s determine which model truly did the best and run it with our testing data.
show_best(tune_team_bt, n = 1, metric = "roc_auc") %>%
select(-.estimator, .config)
Out of the hundreds of boosted tree models, the 25th tree performed the best. You can see its parameters above.
Now let’s run our model on our testing data and see how well it performed.
final_bt_testing = augment(final_bt, team_test) %>%
select(outcome, starts_with(".pred"))
roc_auc(final_bt_testing, truth = outcome, .pred_1:.pred_7)
Very good! A 0.95 roc_auc value on our testing data is a
fantastic score, and a sign that our models did a great job predicting
the outcome.
Now that we’ve seen how well our model boosted tree model ran on both training and testing data, let’s take a look at some graphs and visualizations to see more specifics about our performance.
Let’s see which predictors were most important in its performance.
final_bt %>%
extract_fit_parsnip() %>%
vip() + theme_minimal()
It seems that win percentage played a significant role in predicting
our outcome variable, which makes sense because of how correlated win
percentage and net rating are in predicting overall team success. The
next highest importance predictors were pts_per_100_poss,
or offensive rating, and defensive rating. These also made sense as net
rating is calculated by using these two.
Let’s take a look at ROC curves for how well our model predicted each individual predictor. For reference, the perfect curve looks like a 90 degree angle.
final_bt_testing %>%
roc_curve(outcome, .pred_1:.pred_7) %>%
autoplot()
In our case the best prediction was tier 7, or tanking, and the worst was tier 4, or fringe playoff teams.
A confusion matrix shows how well the model did in predicting each class, with one side being actual values and one side being predicted.
conf_mat(final_bt_testing, truth = outcome, .pred_class) %>%
autoplot(type = "heatmap")
As described by the ROC curves, the model was able to predict our tanking teams at a very high rate, but struggled a decent bit predicting fringe playoff teams placing many of those values a tier too high or low.
For this section of the project, I did not want to cherry-pick some of the easiest-to-predict teams of all time, those with the most extreme data points either high (2017 Warriors, 1996 Bulls), or low (2024 Wizards, 2012 Bobcats). So, to avoid this, I spun a wheel of NBA teams then ran a random number generator in order to make it completely random. I did this for two teams, then did it for another two teams who I thought might have a chance at not being predicted well due to outside factors.
To make these predictions, I created a 1x19 data frame containing each team’s stats from that season and told my model to predict the outcome. Here are the results!
Actual: Dangerous team (tier 3)
team_info %>% filter(team == "Detroit Pistons", season == "2005")
The Pistons in 2005 performed much better than their numbers predicted them to be, ending up first in the Eastern conference and making it all the way to the NBA Finals before losing to the San Antonio Spurs.
Their actual value is tier 3, let’s see what the model predicted.
pistons = data.frame(
win_pct = 0.659,
fga_per_100_poss = 88.6,
fg_percent = 0.444,
x3p_per_100_poss = 5,
x3p_percent = 0.345,
ft_percent = 0.739,
orb_per_100_poss = 14.5,
drb_per_100_poss = 34.6,
trb_per_100_poss = 49.1,
ast_per_100_poss = 24.7,
stl_per_100_poss = 7.9,
blk_per_100_poss = 6.9,
tov_per_100_poss = 15.6,
pf_per_100_poss = 22.6,
pts_per_100_poss = 105.6,
sos = -0.55,
d_rtg = 101.2,
pace = 87.2,
f_tr = 0.335
)
predict(final_bt, new_data = pistons, type = "class")
Nice! Our model got it correct. The Pistons that year were much more than dangerous, maybe they should have been placed in a higher tier than they were.
Actual: Play-in team (tier 5)
team_info %>% filter(team == "Dallas Mavericks", season == "2000")
The Mavericks in 2000 were not a great team, despite having two young future Hall of Famers in Steve Nash and Dirk Nowitzki. They ended up going 40-42 and missed the playoffs entirely. However if the play-in existed during the time they would have been right there, which is where the numbers have them. Let’s see what the model says.
mavs = data.frame(
win_pct = 0.488,
fga_per_100_poss = 90.3,
fg_percent = 0.453,
x3p_per_100_poss = 6.7,
x3p_percent = 0.391,
ft_percent = 0.804,
orb_per_100_poss = 11.9,
drb_per_100_poss = 31.3,
trb_per_100_poss = 43.3,
ast_per_100_poss = 23.2,
stl_per_100_poss = 7.6,
blk_per_100_poss = 5.3,
tov_per_100_poss = 14.4,
pf_per_100_poss = 22.7,
pts_per_100_poss = 106.6,
sos = 0.29,
d_rtg = 107.2,
pace = 94.9,
f_tr = 0.248
)
predict(final_bt, new_data = mavs, type = "class")
Another success! The model got this one right as well.
Now let’s try two teams from this past season that either underperformed or overperformed their regular season data, which is all that I used for this data set.
Actual: Play-in Team (tier 5)
team_info %>% filter(team == "Miami Heat", season == "2023")
The 2023 Miami massively disappointed in the regular season, going from a team who finished as the 1 seed in 2022 to being the 7 seed going into the playoffs. Let’s see what the model thinks about their regular season numbers.
heat = data.frame(
win_pct = 0.537,
fga_per_100_poss = 88,
fg_percent = .46,
x3p_per_100_poss = 12.3,
x3p_percent = .344,
ft_percent = .831,
orb_per_100_poss = 10,
drb_per_100_poss = 31.9,
trb_per_100_poss = 41.9,
ast_per_100_poss = 24.6,
stl_per_100_poss = 8.2,
blk_per_100_poss = 3.1,
tov_per_100_poss = 13.9,
pf_per_100_poss = 19.1,
pts_per_100_poss = 113,
sos = .18,
d_rtg = 113.3,
pace = 96.3,
f_tr = .27
)
predict(final_bt, new_data = heat, type = "class")
Our model got it correct! However, because we are using regular season data, the model cannot account for the fact that Jimmy Butler turns into Prime Michael Jordan in the playoffs, and willed his team to the NBA Finals before losing to the Denver Nuggets in six games.
Let’s take a look at one more interesting case from last season.
Actual: Dangerous team (tier 3)
team_info %>%filter( team == 'Milwaukee Bucks', season == "2023")
The Bucks in 2023 were the top seeded team in the Eastern Conference going into the playoffs, and were viewed by most as an inner-circle title contender who would walk to at least the Eastern Conference Finals.
This could be a case where the data knows more than the analysts do, as it only has them as a team that could have possibly made a run than an actual contender. Let’s see what our model says.
bucks = data.frame(
win_pct = 0.707,
fga_per_100_poss = 89.2,
fg_percent = 0.473,
x3p_per_100_poss = 14.7,
x3p_percent = 0.368,
ft_percent = 0.743,
orb_per_100_poss = 11,
drb_per_100_poss = 37,
trb_per_100_poss = 48,
ast_per_100_poss = 25.5,
stl_per_100_poss = 6.3,
blk_per_100_poss = 4.9,
tov_per_100_poss = 14.4,
pf_per_100_poss = 17.8,
pts_per_100_poss = 115.4,
sos = -0.02,
d_rtg = 111.9,
pace = 100.5,
f_tr = .248
)
predict(final_bt, new_data = bucks, type = "class")
In this case, the model differed from our numbers and seems to be more in line with the analysts in the Bucks being a contender.
However, the Bucks infamously collapsed and lost in the first round to the aforementioned Miami Heat, so having some playoff data and full-season data would be better for this model’s performance on teams in the full season.
Through the process of this project, with extensive research, model building and analysis, my finding is that the best model to determine team success in the NBA is a boosted tree. Though it was not perfect, it did a great job for the parameters it was given to be able to build a good baseline prediction.
As possible improvements are concerned, I think the models ran fine,
I just believe including some sort of playoff data will help improve the
overall prediction. It is unfortunate that I had to remove the
playoffs column from my data frame, as data from this
season would be incorrect if I included it. Maybe I will revisit this
project in the NBA offseason, after the dataset has been updated by the
creator to include the rest of the playoffs.
I also experimented with the inclusion of season as a
dummy variable, but that dropped many of my metrics down by at least 10
percent, so only running my models to include numeric data was the best
option.
It will not be a part of this project, but I plan on remaking this entire project on Python, to try implementing a neural network and seeing how that compares to the boosted tree, as well as gain more experience with Scikit-learn and its many capabilities.
Overall, this project was a great opportunity for me to build experience with building models and machine learning as a whole, and the freedom to choose my topic made me even more enthusiastic about the project!